Today's session:

  • Cluster analysis (kmeans and hierarchical)
  • Reading raster files
  • Principal component analysis

Cluster analysis

Cluster analysis

Cluster analysis assignes each data point to one of k clusters, thereby maximmizing the between-cluster variance and minimizing the within-cluster variance.

There are two principal types of cluster algorithms: partionating algorithms (e.g. kmeans) and hierarchical algorithms.

kmeans

A kmeans cluster analysis can be performed using the kmeans() function:

data(iris)

iris_kmeans <- kmeans(x = iris[, c(1:4)], centers = 3)

kmeans

## K-means clustering with 3 clusters of sizes 38, 62, 50
## 
## Cluster means:
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     6.850000    3.073684     5.742105    2.071053
## 2     5.901613    2.748387     4.393548    1.433871
## 3     5.006000    3.428000     1.462000    0.246000
## 
## Clustering vector:
##   [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [36] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [71] 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 1 1
## [106] 1 2 1 1 1 1 1 1 2 2 1 1 1 1 2 1 2 1 2 1 1 2 2 1 1 1 1 1 2 1 1 1 1 2 1
## [141] 1 1 2 1 1 1 2 1 1 2
## 
## Within cluster sum of squares by cluster:
## [1] 23.87947 39.82097 15.15100
##  (between_SS / total_SS =  88.4 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

kmeans

We can plot the cluster solution using ggplot:

iris$cluster3 <- iris_kmeans$cluster

p <- ggplot(iris, aes(Sepal.Length, y = Sepal.Width, col = factor(cluster3))) +
  geom_point()

kmeans

kmeans

And compare it to the acutal species classification:

table(iris$cluster3, iris$Species)
##    
##     setosa versicolor virginica
##   1      0          2        36
##   2      0         48        14
##   3     50          0         0

kmeans

More/less clusters lead to different solutions:

iris$cluster2 <- kmeans(x = iris[, c(1:4)], centers = 2)$cluster
iris$cluster4 <- kmeans(x = iris[, c(1:4)], centers = 4)$cluster
iris$cluster5 <- kmeans(x = iris[, c(1:4)], centers = 5)$cluster
iris$cluster6 <- kmeans(x = iris[, c(1:4)], centers = 6)$cluster

p <- iris %>%
  gather(key = k, value = cluster, -(Sepal.Length:Species)) %>%
  ggplot(., aes(Sepal.Length, y = Sepal.Width, col = factor(cluster))) +
  geom_point() +
  facet_wrap(~k, ncol = 6)

kmeans

Hierarchical clustering

Hierarchical clustering can be performed using the hclust() function. However, we first need to calcuate the distances (similarities) between each data point pair:

dist <- dist(iris[, 1:4])

There are several methods for calculating the distance (e.g., euclidean, manhatten, …), see ?dist(). With the distance matrix we can hierarchically cluster all data points:

iris_hc <- hclust(dist)

And plot i using the ggdrndro package:

library(ggdendro)

p <- ggdendrogram(iris_hc)

See https://cran.r-project.org/web/packages/ggdendro/vignettes/ggdendro.html for further information.

Hierarchical clustering

Reading raster files

Reading raster files

For todays exercise we are going to use data that is stored in a raster stack, that is a stack of gridded rasters with matching spatial resolution and extent.

Reading a raster stack can be done using the stack() function in the raster package:

library(raster)

clim <- stack("Data/climate.envi")

Raster stacks can be of any format (geoTiff, ENVI, grid, bsq, etc.) and with or without spatial reference. For reading a single raster instead of a stack, you can use the raster() function.

Reading raster files

clim
## class       : RasterStack 
## dimensions  : 709, 1306, 925954, 4  (nrow, ncol, ncell, nlayers)
## resolution  : 0.008333333, 0.008333333  (x, y)
## extent      : 16.90833, 27.79167, 44.33333, 50.24167  (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0 
## names       :       Band.1,       Band.2,       Band.3,       Band.4 
## min values  :   0.08333333, -52.83333333, -26.41666667,  38.41666667 
## max values  :     173.2500,      77.2500,     124.6667,     146.6667
names(clim) <- c("Tmin", "Tmean", "Tmax", "Precip")

Reading raster files

Plotting the raster is easily achieved by the plot() command. However, you have to select a layer for plotting:

plot(clim, y = 1)

Reading raster files

Reading raster files

Rasters can have millions of pixels, which can make computations very slow. To deal with this issue, we are going to sample 10,000 pixels from the raster:

clim_samp <- sampleRandom(clim, size = 10000)
clim_samp <- as.data.frame(clim_samp) # sampleRandom returns a matrix

clim_samp[1:5, ]
##       Tmin    Tmean      Tmax   Precip
## 1 158.8333 54.75000 106.50000 49.25000
## 2 119.9167 38.41667  79.00000 54.33333
## 3 131.6667 36.41667  83.75000 57.91667
## 4 141.5833 50.58333  95.91666 57.33333
## 5 104.9167 21.25000  62.91667 61.75000

Reading raster files

library(GGally)

p <- clim_samp %>%
  sample_n(., size = 100) %>%
  ggpairs(.)

Reading raster files

Principle Component Analysis

Principle Component Analysis

There are two functions in R for performaing a PCA: princomp() and prcomp(). While the former function calculates the eigenvalue on the correlation/covariance matrix using the function eigen(), the latter function uses singular value decomposition, which numerically more stable. We here are going to use prcomp().

Principle Component Analysis

pca <- prcomp(clim_samp, scale. = TRUE)

You should set the scale. argument to TRUE, to scale the data matrix to unit variances. The default is scale. = FALSE. You can also use scale() to standardize the original data matrix and then enter the standardized data matrix to prcomp.

Principle Component Analysis

The prcomp object stores the rotations (eigenvectors) which define the contribution of each variable to a component:

pca$rotation
##               PC1        PC2          PC3           PC4
## Tmin   -0.5277307 -0.2333982  0.679557980 -4.530193e-01
## Tmean  -0.5274517 -0.2265064 -0.732016319 -3.669355e-01
## Tmax   -0.5324369 -0.2324769  0.048308505  8.124849e-01
## Precip  0.3997574 -0.9166104 -0.004398646 -4.017791e-05

Principle Component Analysis

The eigenvalues are also stored in the prcomp object:

pca$sdev
## [1] 1.850517965 0.734087442 0.191547478 0.002907044

And can be easily translated into variance explained:

var_exp <- data.frame(pc = 1:4,
                      var_exp = pca$sdev^2 / sum(pca$sdev^2))

var_exp$var_exp_cumsum <- cumsum(var_exp$var_exp)

p <- ggplot(var_exp) +
  geom_bar(aes(x = pc, y = var_exp), stat = "identity") +
  geom_line(aes(x = pc, y = var_exp_cumsum))

Principle Component Analysis

Principle Component Analysis

We can now apply the PCA transformation to the whole raster stack:

clim_pca <- raster::predict(clim, pca, index = 1:4)

Principle Component Analysis

Principle Component Analysis

Principle Component Analysis

Principle Component Analysis

Now the magic!

Let's combine PCA and kmeans:

pca_values <- as.data.frame(clim_pca)
pca_kmeans <- kmeans(pca_values[, 1:3], centers = 5)
kmean_raster <- raster(clim_pca)
values(kmean_raster) <- pca_kmeans$cluster
plot(kmean_raster)

Now the magic!